library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lubridate)
library(mgcv)
## Loading required package: nlme
## 
## Attaching package: 'nlme'
## 
## The following object is masked from 'package:dplyr':
## 
##     collapse
## 
## This is mgcv 1.9-1. For overview type 'help("mgcv-package")'.
cyto <- read_csv('data/HOT_cyto_counts_edit.csv') %>%
  select(!'...1')
## New names:
## Rows: 1217 Columns: 10
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," dbl
## (10): ...1, botid, date, press, chl, hbact, pro, syn, picoeuk, cruise
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
mlds <- read_csv('data/hot_mlds.csv')
## Rows: 160 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (7): crn, date, julian, n, mean, stdDev, stdErr
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

We will focus only on pro, hbact, and picoeuk, and explore how the niches of these three groups differ from one another.

“botid” (bottle ID), “date” (date, in mmddyy format), “press” (pressure in decibars) “chl” (fluorometric chlorophyll a concentration, in micrograms L-1), “hbact” (heterotrophic bacteria concentration, in 105 cells mL-1), “pro” (Prochlorococcus concentration, in 105 cells mL-1), “syn” (Synechococcus concentration, in 105 cells mL-1), “picoeuk” (photosynthetic picoeukaryote concentration, in 105 cells mL-1), and “cruise” (cruise ID). Note that pressure in decibars is approximately equal to depth in meters (i.e., depth of the sampling device is measured using a pressure sensor).

#1.

To create a day of the year predictor you will need to first convert the ‘date’ column to date format, and then make a new column that extracts the day of the year from the date column. There are helpful functions in the package ‘lubridate’ that will do these steps for you.

# convert date column to date format - then extract day of year from date column
cyto <- cyto %>%
  mutate(date = mdy(date),
         day = yday(date)
         )

When fitting the 2D smoother for each type of microbe, consider how the response variable should be modeled (transformed or not, normal or non-normal). You can see the probability distributions available for the gam() function in package mgcv by looking at the help file titled ‘family.mgcv’. - assess normality - assess potential leverage of outlier points

# look at dist for each of three response variables
hist(cyto$pro) # bimodal? square root transform - possible high leverage to right

hist(sqrt(cyto$pro)) # this still isn't normal, dist appears more even now tho

hist(log(cyto$pro)) # heavily skewed, don't use

hist(cyto$hbact) # normal - all good

hist(cyto$picoeuk) # Poisson? square root transform

hist(sqrt(cyto$picoeuk)) # now appears normal, looks good!

incorporate response variable transformations

# square root transform pro
cyto$sqrt_pro <- sqrt(cyto$pro)

# square root transform picoeuk
cyto$sqrt_picoeuk <- sqrt(cyto$picoeuk)

For each of the three groups of microbes (pro, hbact, and picoeuk), fit a 2D smoother that characterizes how abundance changes with depth and with day of the year (i.e., from day 1 to day 365 or 366).

sqrt_pro_gam

# sqrt_pro gam
gam_sqrt_pro = gam(sqrt_pro ~ s(press, day), data = cyto) # smoothing off press, day

gam.check(gam_sqrt_pro) # residuals look perfect, k = 29

## 
## Method: GCV   Optimizer: magic
## Smoothing parameter selection converged after 8 iterations.
## The RMS GCV score gradient at convergence was 5.08649e-07 .
## The Hessian was positive definite.
## Model rank =  30 / 30 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##                k'  edf k-index p-value
## s(press,day) 29.0 27.4    1.07       1
summary(gam_sqrt_pro) # edf 27.4 - 90% dev explained
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## sqrt_pro ~ s(press, day)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.037586   0.004633   223.9   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                edf Ref.df     F p-value    
## s(press,day) 27.39  28.84 375.4  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.899   Deviance explained = 90.1%
## GCV = 0.026751  Scale est. = 0.026127  n = 1217
plot(gam_sqrt_pro) # not sure how to interpret these

# large peak in sqrt_pro around press of 70 and from roughly March to Oct?

hbact gam

# hbact gam
gam_hbact = gam(hbact ~ s(press, day), data = cyto) # smoothing off press, day

gam.check(gam_hbact) # residuals look great, k = 29

## 
## Method: GCV   Optimizer: magic
## Smoothing parameter selection converged after 6 iterations.
## The RMS GCV score gradient at convergence was 2.491103e-06 .
## The Hessian was positive definite.
## Model rank =  30 / 30 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##                k'  edf k-index p-value  
## s(press,day) 29.0 22.8    0.96   0.065 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(gam_hbact) # edf 22.9 - 74% dev explained
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## hbact ~ s(press, day)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.88022    0.01811   214.3   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                edf Ref.df     F p-value    
## s(press,day) 22.85  26.88 123.2  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.732   Deviance explained = 73.7%
## GCV = 0.40709  Scale est. = 0.39911   n = 1217
plot(gam_hbact) # small peak for hbact press of 30, in roughly Sep

sqrt_picoeuk_gam

# sqrt_picoeuk gam
gam_sqrt_picoeuk = gam(sqrt_picoeuk ~ s(press, day), data = cyto) # smoothing off press, day

gam.check(gam_sqrt_picoeuk) # residuals look good, k = 29

## 
## Method: GCV   Optimizer: magic
## Smoothing parameter selection converged after 5 iterations.
## The RMS GCV score gradient at convergence was 6.964021e-07 .
## The Hessian was positive definite.
## Model rank =  30 / 30 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##                k'  edf k-index p-value
## s(press,day) 29.0 24.3    1.03    0.82
summary(gam_sqrt_picoeuk) # edf 24.3 - only 52% dev explained
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## sqrt_picoeuk ~ s(press, day)
## 
## Parametric coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.0933431  0.0006991   133.5   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##               edf Ref.df     F p-value    
## s(press,day) 24.3  27.72 45.83  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.514   Deviance explained = 52.3%
## GCV = 0.00060737  Scale est. = 0.00059475  n = 1217
plot(gam_sqrt_picoeuk) # not sure how to interpret these

# several small peaks in sqrt_picoeuk - deep early, shallow mid, deep late in year

Consider whether the basis dimension needs to be increased beyond the default value. - default edf values: sqrt_pro: 27.4, hbact: 22.9, sqrt_picoeuk: 24.3 - default k values for all: 29 - double k for each gam

double k for sqrt_pro

gam_sqrt_pro_k = gam(sqrt_pro ~ s(press, day, k = 60), data = cyto) 

gam.check(gam_sqrt_pro_k) # residuals look good, k = 59 - diff than default

## 
## Method: GCV   Optimizer: magic
## Smoothing parameter selection converged after 5 iterations.
## The RMS GCV score gradient at convergence was 7.900007e-07 .
## The Hessian was positive definite.
## Model rank =  60 / 60 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##                k'  edf k-index p-value
## s(press,day) 59.0 38.7    1.08       1
summary(gam_sqrt_pro_k) # edf 38.7 - 90% dev explained
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## sqrt_pro ~ s(press, day, k = 60)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.037586   0.004635   223.9   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                edf Ref.df     F p-value    
## s(press,day) 38.66   48.7 221.5  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.899   Deviance explained = 90.2%
## GCV = 0.027022  Scale est. = 0.026141  n = 1217
plot(gam_sqrt_pro_k)

note that doubling the k value for sqrt_pro gam produces different result - suggests that default settings did not allow for enough complexity in the model - although the plot appears identical?

double k for hbact

gam_hbact_k = gam(hbact ~ s(press, day, k = 60), data = cyto)

gam.check(gam_hbact_k) # residuals look great, k = 59

## 
## Method: GCV   Optimizer: magic
## Smoothing parameter selection converged after 4 iterations.
## The RMS GCV score gradient at convergence was 5.820414e-13 .
## The Hessian was positive definite.
## Model rank =  60 / 60 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##                k'  edf k-index p-value  
## s(press,day) 59.0 27.8    0.97    0.06 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(gam_hbact_k) # edf 27.8 - 74% dev explained
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## hbact ~ s(press, day, k = 60)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   3.8802     0.0181   214.4   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                edf Ref.df     F p-value    
## s(press,day) 27.83  37.42 88.35  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.732   Deviance explained = 73.8%
## GCV = 0.40845  Scale est. = 0.39877   n = 1217
plot(gam_hbact_k) # small peak for hbact press of 30, in roughly Sep

same as above, doubling k changes the edf value suggesting default settings didn’t adequately account for necessary complexity in the model - but the plot is pretty much identical

double k for sqrt_picoeuk

gam_sqrt_picoeuk_k = gam(sqrt_picoeuk ~ s(press, day, k = 60), data = cyto) 

gam.check(gam_sqrt_picoeuk_k) # residuals look good, k = 59 - diff than default

## 
## Method: GCV   Optimizer: magic
## Smoothing parameter selection converged after 4 iterations.
## The RMS GCV score gradient at convergence was 1.08866e-07 .
## The Hessian was positive definite.
## Model rank =  60 / 60 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##                k'  edf k-index p-value
## s(press,day) 59.0 37.7    1.07       1
summary(gam_sqrt_picoeuk_k) # edf 37.7 - 54% dev explained - basically the same
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## sqrt_picoeuk ~ s(press, day, k = 60)
## 
## Parametric coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.0933431  0.0006909   135.1   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                edf Ref.df    F p-value    
## s(press,day) 37.72  47.84 27.9  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.525   Deviance explained =   54%
## GCV = 0.00059994  Scale est. = 0.00058085  n = 1217
plot(gam_sqrt_picoeuk_k)

in this case it doesn’t appear that doubling k did much for the model - the final edf value was basically equal to the default settings - and again, plot appears almost identical to default settings

Plot the fitted smoother in a way that is visually appealing.

plot(gam_sqrt_pro, select = 1, scheme = 2, lwd = 2)

plot(gam_hbact, select = 1, scheme = 2, lwd = 2)

plot(gam_sqrt_picoeuk, select = 1, scheme = 2, lwd = 2)

Finally, figure out how to test whether the relationship between abundance and depth changes over time or not. - already have a 2D smoother which models the interaction between two vars - so now create 1D smoother which would not incorporate interaction - then compare R2 of these two models for each response var

sqrt_pro - looks constant throughout the year, intermediate depth peak - 2D model fits marginally better

# sqrt_pro 1D gam
gam_sqrt_pro_1 = gam(sqrt_pro ~ s(press) + s(day), data = cyto)

gam.check(gam_sqrt_pro_1) # residuals look good, k = 9

## 
## Method: GCV   Optimizer: magic
## Smoothing parameter selection converged after 8 iterations.
## The RMS GCV score gradient at convergence was 2.248537e-07 .
## The Hessian was positive definite.
## Model rank =  19 / 19 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##            k'  edf k-index p-value    
## s(press) 9.00 5.98    1.03    0.86    
## s(day)   9.00 5.59    0.41  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(gam_sqrt_pro_1) # 89% dev explained
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## sqrt_pro ~ s(press) + s(day)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.037586   0.004936   210.2   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##            edf Ref.df       F p-value    
## s(press) 5.977  6.773 1373.56  <2e-16 ***
## s(day)   5.588  6.727   15.45  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.885   Deviance explained = 88.7%
## GCV = 0.029956  Scale est. = 0.029646  n = 1217
plot(gam_sqrt_pro_1)

# for sqrt_pro, doesn't look like day has a large impact
# but definitely see peak of sqrt_pro at intermediate depth, then drop off

hbact - see two peaks throughout the year, intermediate depth peak - 2D model fits marginally better

# hbact 1D gam
gam_hbact_1 = gam(hbact ~ s(press) + s(day), data = cyto)

gam.check(gam_hbact_1) # residuals look great, k = 9

## 
## Method: GCV   Optimizer: magic
## Smoothing parameter selection converged after 8 iterations.
## The RMS GCV score gradient at convergence was 2.191183e-07 .
## The Hessian was positive definite.
## Model rank =  19 / 19 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##            k'  edf k-index p-value    
## s(press) 9.00 4.51    0.99    0.33    
## s(day)   9.00 7.79    0.41  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(gam_hbact_1) # 71% dev explained
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## hbact ~ s(press) + s(day)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.88022    0.01892   205.1   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##            edf Ref.df     F p-value    
## s(press) 4.511  5.437 518.7  <2e-16 ***
## s(day)   7.791  8.617  14.6  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.707   Deviance explained =   71%
## GCV = 0.44049  Scale est. = 0.43568   n = 1217
plot(gam_hbact_1)

# two peaks for day
# intermediate depth peak

sqrt_picoeuk - sees much deeper peak than other species, lots of var throughout year - again, 2D model fits marginally better, although neither fit well compared to other sp.

# sqrt_picoeuk 1D gam
gam_sqrt_picoeuk_1 = gam(sqrt_picoeuk ~ s(press) + s(day), data = cyto)

gam.check(gam_sqrt_picoeuk_1) # residuals look good, k = 9

## 
## Method: GCV   Optimizer: magic
## Smoothing parameter selection converged after 7 iterations.
## The RMS GCV score gradient at convergence was 8.951954e-08 .
## The Hessian was positive definite.
## Model rank =  19 / 19 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##            k'  edf k-index p-value    
## s(press) 9.00 6.36    0.99     0.3    
## s(day)   9.00 8.53    0.60  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(gam_sqrt_picoeuk_1) # only 48% dev explained
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## sqrt_picoeuk ~ s(press) + s(day)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.093343   0.000728   128.2   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##            edf Ref.df      F p-value    
## s(press) 6.359  7.045 137.31  <2e-16 ***
## s(day)   8.527  8.935  14.51  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.473   Deviance explained = 47.9%
## GCV = 0.00065359  Scale est. = 0.00064506  n = 1217
plot(gam_sqrt_picoeuk_1)

# see peak much deeper than other species
# lots of small spikes throughout the year

What are your interpretations of the results so far? - Different species have different abundance patterns across depth and time - pro and hbact peak at intermediate depths, around 75 (m?), picoeuk peaks deeper, at around 100 (m?) - pro appears relatively constant throughout the year, hbact has two distinct abundance peaks throughout the year (~May and September), picoeuk has four abundance peaks throughout the year.

#2.

To fit GAM(s) including all three groups simultaneously you will need to convert the data to ‘long’ format, where there is a column that contains all the concentrations of all three types of microbes, and a second column that codes which microbe was counted in that row, as well as additional columns for the other model predictors. You can convert to long format by hand using a spreadsheet, or you can use a helpful function called pivot_longer() in the package ‘tidyr’.

cyto_long <- cyto %>%
  pivot_longer(
    cols = matches("sqrt_pro|hbact|sqrt_picoeuk"),
    names_to = "Microbe",
    values_to = "Abundance"
  )

cyto_long$Microbe <- as.factor(cyto_long$Microbe)

Now let’s compare the niches of the three groups to each other. Use a GAM including all groups simultaneously to simultaneously test three questions: (a) Do the different kinds of microbes have different mean abundances? - Abundance ~ Microbe, not using s b/c not continuous (b) Do the different kinds of microbes have different average depth distributions (i.e., averaging over time)? - s(Press, by = Microbe) (c) Do the different kinds of microbes have different average seasonal dynamics (i.e., averaging over depths)? - s(Day, by = Microbe)

super_model <- gam(Abundance ~ Microbe +
                 s(press, by = Microbe) +
                 s(day, by = Microbe),
               data = cyto_long)

gam.check(super_model) # residuals do not look good - uh oh!

## 
## Method: GCV   Optimizer: magic
## Smoothing parameter selection converged after 9 iterations.
## The RMS GCV score gradient at convergence was 5.038129e-08 .
## The Hessian was positive definite.
## Model rank =  57 / 57 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##                                k'  edf k-index p-value    
## s(press):Microbehbact        9.00 8.72    0.89  <2e-16 ***
## s(press):Microbesqrt_picoeuk 9.00 1.54    0.89  <2e-16 ***
## s(press):Microbesqrt_pro     9.00 4.92    0.89  <2e-16 ***
## s(day):Microbehbact          9.00 8.64    0.89  <2e-16 ***
## s(day):Microbesqrt_picoeuk   9.00 1.00    0.89  <2e-16 ***
## s(day):Microbesqrt_pro       9.00 2.46    0.89  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# says k may be too low, b/c close to edf
summary(super_model) 
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## Abundance ~ Microbe + s(press, by = Microbe) + s(day, by = Microbe)
## 
## Parametric coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          3.88022    0.01126   344.5   <2e-16 ***
## Microbesqrt_picoeuk -3.78688    0.01593  -237.7   <2e-16 ***
## Microbesqrt_pro     -2.84263    0.01593  -178.4   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                                edf Ref.df       F p-value    
## s(press):Microbehbact        8.718  8.960 892.621  <2e-16 ***
## s(press):Microbesqrt_picoeuk 1.540  1.896   0.840  0.3440    
## s(press):Microbesqrt_pro     4.916  5.853 304.036  <2e-16 ***
## s(day):Microbehbact          8.641  8.962  40.918  <2e-16 ***
## s(day):Microbesqrt_picoeuk   1.000  1.000   0.248  0.6184    
## s(day):Microbesqrt_pro       2.460  3.055   5.034  0.0017 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.951   Deviance explained = 95.2%
## GCV = 0.15571  Scale est. = 0.15442   n = 3651
# re-run with diff k value, set at 30, tried 20 but k still close to edf
super_model <- gam(Abundance ~ Microbe +
                 s(press, by = Microbe, k = 30) +
                 s(day, by = Microbe, k = 30),
               data = cyto_long)

gam.check(super_model) # still don't like residuals

## 
## Method: GCV   Optimizer: magic
## Smoothing parameter selection converged after 18 iterations.
## The RMS GCV score gradient at convergence was 9.285798e-08 .
## The Hessian was positive definite.
## Model rank =  177 / 177 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##                                 k'   edf k-index p-value    
## s(press):Microbehbact        29.00 23.07    0.89  <2e-16 ***
## s(press):Microbesqrt_picoeuk 29.00  1.59    0.89  <2e-16 ***
## s(press):Microbesqrt_pro     29.00  4.98    0.89  <2e-16 ***
## s(day):Microbehbact          29.00 28.70    0.90  <2e-16 ***
## s(day):Microbesqrt_picoeuk   29.00  1.00    0.90  <2e-16 ***
## s(day):Microbesqrt_pro       29.00  2.51    0.90  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(super_model) 
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## Abundance ~ Microbe + s(press, by = Microbe, k = 30) + s(day, 
##     by = Microbe, k = 30)
## 
## Parametric coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          3.88022    0.01085   357.6   <2e-16 ***
## Microbesqrt_picoeuk -3.78688    0.01534  -246.8   <2e-16 ***
## Microbesqrt_pro     -2.84263    0.01534  -185.3   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                                 edf Ref.df       F  p-value    
## s(press):Microbehbact        23.068 26.193 330.669  < 2e-16 ***
## s(press):Microbesqrt_picoeuk  1.589  1.964   1.042 0.302888    
## s(press):Microbesqrt_pro      4.978  5.937 323.136  < 2e-16 ***
## s(day):Microbehbact          28.704 28.991  21.362  < 2e-16 ***
## s(day):Microbesqrt_picoeuk    1.000  1.000   0.268 0.604983    
## s(day):Microbesqrt_pro        2.511  3.126   5.325 0.000998 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.955   Deviance explained = 95.6%
## GCV = 0.14586  Scale est. = 0.14327   n = 3651
# sig differences in abundance across three different microbe types (hbact highest)
# hbact, sqrt_pro vary by depth, not sqrt_picoeuk
# hbact, sqrt_pro vary by day, not sqrt_picoeuk

Now that you have fit models to test questions (a)-(c), make appropriate plots and perform appropriate hypothesis tests.

plot(super_model)

How do you intepret the results? - sig differences in abundance across three different microbe types (hbact highest) - hbact, sqrt_pro vary by depth, not sqrt_picoeuk - hbact, sqrt_pro vary by day, not sqrt_picoeuk

looking at plots: - lots of variation across depth for hbact - no variation across depth for picoeuk - pro decreases as depth increases - lots of variation across day for hbact - no variation across day for picoeuk - effectively no variation across day for pro

#3.

Finally, let’s investigate how abundances of the three groups at shallower depths correlate with mixed layer depth (an index of stratification) and chlorophyll a concentration. The second attached file, hot_mlds.csv, contains the average mixed layer depth for each HOT cruise. You’ll need to merge the information in this file with the dataset you have been analyzing. - run join_by on key - I guess crn is cruise? No consistent key - poor data management practices!

# rename crn to cruise
mlds$cruise <- mlds$crn
  
joined_df <- cyto_long %>%
  left_join(mlds, by = 'cruise')

Using only data from the top 45 meters

filter to only retain top 45 m

joined_df <- joined_df %>%
  filter(press <= 45)

hist(joined_df$press)

how does the concentration of each group of microbes vary with (a) mixed layer depth and (b) Chl a concentration? Test whether the three types of microbes exhibit different relationships with the two predictors. - s(mean, by = Microbe) + s(chl, by = Microbe)

compare to model with 1D smoothers - s(mean) + s(chl)

Use appropriate GAM(s), hypothesis tests, and smoother plots to assess these questions.

mlds_2d model

mlds_2d <- gam(Abundance ~ Microbe + s(mean, by = Microbe) +
                 s(chl, by = Microbe),
               data = joined_df)

gam.check(mlds_2d) # don't like residuals at all

## 
## Method: GCV   Optimizer: magic
## Smoothing parameter selection converged after 10 iterations.
## The RMS GCV score gradient at convergence was 2.852528e-08 .
## The Hessian was positive definite.
## Model rank =  57 / 57 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##                               k'  edf k-index p-value    
## s(mean):Microbehbact        9.00 8.61    0.93   0.005 ** 
## s(mean):Microbesqrt_picoeuk 9.00 1.00    0.93   0.010 ** 
## s(mean):Microbesqrt_pro     9.00 1.00    0.93   0.010 ** 
## s(chl):Microbehbact         9.00 4.99    0.93   0.020 *  
## s(chl):Microbesqrt_picoeuk  9.00 1.00    0.93  <2e-16 ***
## s(chl):Microbesqrt_pro      9.00 1.36    0.93   0.010 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(mlds_2d) # explains 97% deviation
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## Abundance ~ Microbe + s(mean, by = Microbe) + s(chl, by = Microbe)
## 
## Parametric coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          4.78352    0.01998   239.4   <2e-16 ***
## Microbesqrt_picoeuk -4.68602    0.02826  -165.8   <2e-16 ***
## Microbesqrt_pro     -3.36964    0.02826  -119.2   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                               edf Ref.df      F  p-value    
## s(mean):Microbehbact        8.613  8.955 28.141  < 2e-16 ***
## s(mean):Microbesqrt_picoeuk 1.000  1.000  0.014    0.906    
## s(mean):Microbesqrt_pro     1.000  1.000  1.005    0.316    
## s(chl):Microbehbact         4.987  6.079  6.628 1.41e-06 ***
## s(chl):Microbesqrt_picoeuk  1.000  1.000  0.014    0.905    
## s(chl):Microbesqrt_pro      1.363  1.647  0.755    0.342    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.965   Deviance explained = 96.5%
## GCV = 0.14658  Scale est. = 0.14373   n = 1080
plot(mlds_2d)

# hbact sig variation by mean and chl, not for the others

mlds_1d model

mlds_1d <- gam(Abundance ~ s(mean) + s(chl),
               data = joined_df)

gam.check(mlds_1d) # residuals look horrible

## 
## Method: GCV   Optimizer: magic
## Smoothing parameter selection converged after 11 iterations.
## The RMS GCV score gradient at convergence was 2.901478e-07 .
## The Hessian was positive definite.
## Model rank =  19 / 19 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##         k' edf k-index p-value
## s(mean)  9   1    1.48       1
## s(chl)   9   1    1.48       1
summary(mlds_1d)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## Abundance ~ s(mean) + s(chl)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.09830    0.06146   34.14   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##         edf Ref.df     F p-value
## s(mean)   1      1 1.971   0.161
## s(chl)    1      1 0.351   0.554
## 
## R-sq.(adj) =  8.07e-06   Deviance explained = 0.186%
## GCV = 4.0913  Scale est. = 4.0799    n = 1080
plot(mlds_1d)

# no sig relationships between abundance across species and depth, chlorophyll

from 2d model - explains 97% variation - hbact varies by mixed layer depth (several peaks) and chlorophyll a concentration (increases) - neither sqrt_pro or sqrt_picoeuk vary by either depth or chlorophyll a

from 1d model - wow! only explains .19% of variation - this model fits horribly - don’t find any relationship between abundance and either depth or chlorophyll

So yes, the concentration varies of different microbe types varies differently according to depth and chlorophyll concentration - this is driven by hbact, which significantly varies by both variables - the model accounting for differences across microbe types was amazing (~98% R2), whereas the model that didn’t allow for different microbes to respond differently was horrible (<1% R2)